I first loaded required packages for data munging, visualization, and analysis (these are largely Hadley Wickham libraries, plus some Bioconductor tools).
rm(list = ls())
cleanup <- gc(verbose = FALSE)
# Load libraries I'll need here
library(edgeR)
library(limma)
library(biomaRt)
library(ggfortify)
library(sva)
# Load my go-to libraries
library(dplyr)
library(ggplot2)
library(ggthemes)
library(stringr)
library(readr)
library(readxl)
library(reshape2)
# Packages for R markdown stuff
library(knitr)
library(shiny)
Functions for plotting metrics are contained in metric_qc_functions.R.
source("R/metric_qc_functions.R")
This is a function written by Elizabeth Whalen (shared by Michael Mason) that might come in handy with some steps of the analysis. I modified the function slightly, such that library sizes are updated and normalization factors are calculated after filtering genes. I also added the option to input gene annotation information.
# Function to build DGEList object, filter genes by keeping only those having %
# samples with at least N counts, and computes normalization from library sizes
setUpDGEList <- function(countData, geneData = NULL,
filterCount = 1, filterPercentage = 0.1)
{
d <- DGEList(counts = countData, genes = geneData)
# d <- calcNormFactors(d) # moved further down
# Filter all genes that do not have at least 'filterCount' counts per
# million in at least 'filterPercentage' percent of libraries
keepRows <- rowSums(round(cpm(d$counts)) >= filterCount) >=
filterPercentage*ncol(countData)
print(table(keepRows))
curDGE <- d[keepRows, ]
# James: I've added this change so that library sizes and normalization
# factors will always be updated/calculated after filtering genes
# reset library sizes
curDGE$samples$lib.size <- colSums(curDGE$counts)
# calculate normalization factors (effective library size =
# lib.size * norm.factor)
curDGE <- calcNormFactors(curDGE)
return(curDGE)
}
Next, I read counts and metrics data for the project into R, along with sample annotation for project libraries.
# Read CSV file with read counts
countFile <- "data/HMMF2ADXX_combined_counts.csv"
countDat <- read_csv(countFile) # 37991 obs. of 18 variables
# str(countDat)
# Read CSV file with RNAseq/alignment metrics
metricFile <- "data/HMMF2ADXX_combined_metrics.csv"
metricDat <- read_csv(metricFile) # 16 obs. of 71 variables
# str(metricDat)
# Read XLSX file with sample annotation
designFile <- "data/JMD119 Sample Information .xlsx"
designDat <- read_excel(designFile, skip = 1) # 36 obs. of 18 variables
# str(designDat)
I needed to do a bit of cleaning/formatting with variable names (column headers) to make life easier and avoid breaking downstream functions.
# Separate gene counts and gene symbols into separate objects, reformat
# variable names in countDat to only include libID
geneDat <- data_frame(ensemblID = countDat$geneName)
countDat <- countDat %>%
select(-geneName)
names(countDat) <- names(countDat) %>%
str_extract("lib[0-9]+")
# Reformat variable names in metrics data frame
names(metricDat) <- names(metricDat) %>%
str_to_lower() %>% # change variable names to lower case
make.unique(sep = "_") # de-dup variable names
names(metricDat)[1] <- "lib_id" # reformat libID variable name
# Reformat row names in metrics dataframe
metricDat <- metricDat %>%
mutate(lib_id = str_extract(lib_id, "lib[0-9]+"))
# Reformat variable names in design data frame
names(designDat) <- names(designDat) %>%
str_replace_all(" +", "_") %>% # replace spaces with underscores
str_replace_all("#", "num") %>% # replace # with 'num'
str_replace_all("/", "_per_") %>%
str_replace_all("(\\(|\\))", "") %>% # remove parentheses
str_to_lower() %>% # change to lower case
str_replace("(?<=(lib))[a-z]+", "") %>% # replace 'library' with 'lib'
make.unique(sep = "_") # de-dup variable names
# Remove empty rows from design data frame
designDat <- designDat %>%
filter(!is.na(lib_id))
I created a new object to store the salient information about groups in the study I want to compare.
groupDat <- designDat %>%
# extract knockout status (WT or BCAP) and HSC population (long or short'
# term); combine into a single group vector
mutate(koStatus = as.factor(tolower(str_extract(sample_name, "WT|BCAP"))),
hscPop = as.factor(tolower(str_extract(hsc_population,
"Long|Short"))),
group = as.factor(str_c(koStatus, hscPop, sep = "_"))) %>%
select(libID = lib_id,
koStatus, hscPop, group)
For reference, here are the relevant groups in the data (stored in groupDat):
| libID | koStatus | hscPop | group |
|---|---|---|---|
| lib7418 | wt | long | wt_long |
| lib7419 | wt | long | wt_long |
| lib7420 | wt | long | wt_long |
| lib7421 | wt | long | wt_long |
| lib7422 | bcap | long | bcap_long |
| lib7423 | bcap | long | bcap_long |
| lib7424 | bcap | long | bcap_long |
| lib7425 | bcap | long | bcap_long |
| lib7426 | wt | short | wt_short |
| lib7427 | wt | short | wt_short |
| lib7428 | wt | short | wt_short |
| lib7429 | wt | short | wt_short |
| lib7430 | bcap | short | bcap_short |
| lib7431 | bcap | short | bcap_short |
| lib7432 | bcap | short | bcap_short |
| lib7433 | bcap | short | bcap_short |
Next, I looked at a few standard metrics to see whether any libraries should be excluded due to quality reasons.
# Pull out and format the subset of metrics to plot
metricSummary <- metricDat %>%
mutate(percentDuplication = unpaired_read_duplicates /
unpaired_reads_examined) %>%
select(libID = lib_id,
medianCVcoverage = median_cv_coverage,
fastqTotalReads = fastq_total_reads,
percentAligned = mapped_reads_w_dups,
percentDuplication)
Cutoffs are set by default to standard values used in the Bioinformatics Core for three metrics; libraries are considered to have ‘failed’ QC for the following conditions:
I can also adjust slider bars to look at different QC cutoffs (red lines) for the x- and y-axis; dashed lines indicate outlier limits (1.5*IQR).
Percent aligned is simply the number of FASTQ reads for which there is a corresponding alignment in the TopHat BAM file. In other words, percentAligned = # of aligned reads (+ all their duplicate reads, which were removed from the final BAM) / fastqTotalReads.
Percent aligned:
Percentage of aligned reads is well above the 80% cutoff for all libraries, with rates in the mid-90s across the board. lib7422 is outside the nominal outlier range, but still has 91.41% alignment.
Total reads:
While all libraries had well over 10 million reads in the input FASTQ file (after adapter trimming), lib7418 appears to be quite a bit smaller than average the average of 24.27 million reads, with only 13.04 million reads.
Median CV coverage is calculated by Picard by
A high CV of read coverage suggests possible 5’ or 3’ bias, or potentially non-uniform (“bumpy” or “spikey”) coverage of a transcript. If medianCVcoverage is high (> 1), this could indicate a more systemic problem with coverage in the dataset.
Median CV coverage:
All libraries look good (with medianCVcoverage generally close to 0.5) in terms of gene coverage among the top 1000 transcripts.
I plotted the (smoothed) frequency of log-normalized counts in each library, just to get a sense of the distribution.
Without any filtering, there’s a large spike at -1, representing genes with a count of 0, along with a smaller bump between 0 and 1, representing genes with very small counts. Notably, the distribution of counts for lib7418 is shifted dramatically to the left (which makes sense, given the much smaller number of pre-aligned reads).
I used the function defined above to build the DGEList object for the data, which is the input for downstream functions.
# Filter genes with (cpm > 10) in < 10% of samples
dge = setUpDGEList(countData = countDat, geneData = geneDat,
filterCount = 10,
filterPercentage = 0.20)
## keepRows
## FALSE TRUE
## 26239 11752
Keeping only those genes with >= 10 counts per million in at least 10% ( 3.2 samples), we’re left with 11752 genes.
While there are still a small fraction of genes with zero or very few counts, these no longer account for such a large percentage of genes across all libraries.
Correcting for library size, the distribution of counts per million (cpm) is more similar across all libraries, including lib7418.
To verify the effect of the change I made to Elizabeth’s code above, I plotted norm.factors and effective library size (lib.size.eff) under two scenarios:
norm.factors are calculated before genes are filterednorm.factors are calculated after genes are filtered and library sizes are updatedFor the not-too-stringent threshold used to filter genes (CPM > 10 in 20% of samples), the order of operations for calculating norm.factors appears to have minimal impact on effective library sizes.
I used the biomaRt package to add gene symbols (from MGI) corresponding to gene IDs from Ensembl.
# Get MGI gene symbols corresponding to Ensembl Gene IDs
dgeGeneDat <- dge$genes
mart <- useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
ens2Gene <- getBM(attributes = c("ensembl_gene_id", "mgi_symbol"),
filters = "ensembl_gene_id",
values = dgeGeneDat$ensemblID, mart = mart)
ens2Gene = ens2Gene[match(dgeGeneDat$ensemblID, ens2Gene$ensembl_gene_id), ]
# Insert MGI gene symbols for genes in DGE object gene info
dgeGeneDat <- dgeGeneDat %>%
mutate(mgiSymbol = ens2Gene$mgi_symbol,
mgiSymbol = ifelse(is.na(mgiSymbol), "NA", mgiSymbol))
dge$genes <- dgeGeneDat
I performed PCA with the prcomp function and plotted the first two principal components using the autoplot function from the ggfortify package.
Looking just at the long-term HSC populations, WT and BCAP knockout mice seem to cluster separately. When short-term populations are included, there is less distinction between knockout groups.
Experimental, sequencing, and alignment variables plotted against the 1st principal component
Experimental, sequencing, and alignment variables plotted against the 2nd principal component
SVAs computed using normalized data with the model matrix for ~ koStatus + hscPop design.
PC1 and 1st surrogate variable plotted against concentration (ng/ul)
koHscDesign <- model.matrix(~ koStatus + hscPop, data = groupDat)
koHscVoom <- voomWithQualityWeights(dge, design = koHscDesign,
plot = TRUE)
koSva <- sva(koHscVoom$E, koHscVoom$design)
## Number of significant surrogate variables is: 3
## Iteration (out of 5 ):1 2 3 4 5
data_frame(sv1 = koSva$sv[, 1], pc1 = pca$x[, 1],
conc = confoundDat$ng_per_ul) %>%
melt(measure.vars = c("sv1", "pc1")) %>%
ggplot(aes(x = conc, y = value)) +
geom_point() +
stat_smooth(method = "lm") +
facet_wrap(~ variable, scales = "free_y")
Design 1: expression ~ koStatus
# Define the design matrix, only including terms corresponding to KO status;
# use voom to calculate transformed expression values
koDesign <- model.matrix(~ koStatus, data = groupDat)
koVoom <- voomWithQualityWeights(dge, design = koDesign,
plot = TRUE)
# Fit model for the group design
koFit <- lmFit(koVoom, koDesign)
koFit <- eBayes(koFit)
koResults <- topTable(koFit, number = nrow(dge))
koResults %>%
filter(mgiSymbol == "Pik3ap1")
## ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
## 1 ENSMUSG00000025017 Pik3ap1 1.1 5.9 2.6 0.019 0.34 -3.3
Design 2: expression ~ koStatus + hscPop
# Define the design matrix, including terms corresponding to KO status and HSC
# population; use voom to calculate transformed expression values
koHscDesign <- model.matrix(~ koStatus + hscPop, data = groupDat)
koHscVoom <- voomWithQualityWeights(dge, design = koHscDesign,
plot = TRUE)
# Fit model for the group design
koHscFit <- lmFit(koHscVoom, koHscDesign)
koHscFit <- eBayes(koHscFit)
koHscResults <- topTable(koHscFit, coef = 2, number = nrow(dge))
koHscResults %>%
filter(mgiSymbol == "Pik3ap1")
## ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
## 1 ENSMUSG00000025017 Pik3ap1 0.97 5.9 4.4 0.00038 0.087 0.14
Design 3: expression ~ koStatus * hscPop (i.e., koStatus + hscPop + koStatus:hscPop)
# Define the design matrix, including terms corresponding to KO status, HSC
# population, and the interaction between the two variables; use voom to
# calculate transformed expression values
koHscIntDesign <- model.matrix(~ koStatus * hscPop, data = groupDat)
koHscIntVoom <- voomWithQualityWeights(dge, design = koHscIntDesign,
plot = TRUE)
# Fit model for the group design
koHscIntFit <- lmFit(koHscIntVoom, koHscIntDesign)
koHscIntFit <- eBayes(koHscIntFit)
koHscIntResults <- topTable(koHscIntFit, coef = 2, number = nrow(dge))
koHscIntResults %>%
filter(mgiSymbol == "Pik3ap1")
## ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
## 1 ENSMUSG00000025017 Pik3ap1 1.1 5.9 2.6 0.02 0.32 -3.2
Genes with significantly different expression (adj. p-value < 0.05) as a function of BCAP KO status:
# Get data for libraries from long-term HSC population
groupDatLong <- groupDat %>%
filter(hscPop == "long")
countDatLong <- countDat[, names(countDat) %in% groupDatLong$libID]
# Filter genes with (cpm > 10) in < 10% of samples
dgeLong = setUpDGEList(countData = countDatLong, geneData = geneDat,
filterCount = 10,
filterPercentage = 0.20)
## keepRows
## FALSE TRUE
## 26152 11839
Looking just at the long-term HSC populations, WT and BCAP knockout mice seem to cluster separately. When short-term populations are included, there is less distinction between knockout groups.
Experimental, sequencing, and alignment variables plotted against the 1st principal component
Design: expression ~ koStatus
# Define the design matrix, only including terms corresponding to KO status;
# use voom to calculate transformed expression values
koDesignLong <- model.matrix(~ koStatus, data = groupDatLong)
koVoomLong <- voomWithQualityWeights(dgeLong, design = koDesignLong,
plot = TRUE)
# Fit model for the group design
koFitLong <- lmFit(koVoomLong, koDesignLong)
koFitLong <- eBayes(koFitLong)
koResultsLong <- topTable(koFitLong, number = nrow(dgeLong))
koResultsLong %>%
filter(mgiSymbol == "Pik3ap1")
## ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
## 1 ENSMUSG00000025017 Pik3ap1 1.1 5.1 2.7 0.022 0.34 -3.1
koResultsLong %>% head()
## ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val
## 11177 ENSMUSG00000038418 Egr1 0.94 7.9 7.2 0.000022 0.13
## 18406 ENSMUSG00000064215 Ifi27 -0.81 7.1 -7.0 0.000030 0.13
## 4479 ENSMUSG00000024378 Stard4 -0.76 7.5 -6.3 0.000075 0.13
## 3785 ENSMUSG00000022565 Plec 0.88 7.5 5.9 0.000125 0.13
## 15467 ENSMUSG00000052369 Tmem106c 0.76 6.5 6.0 0.000119 0.13
## 34159 ENSMUSG00000089774 Slc5a3 1.47 5.9 6.1 0.000102 0.13
## B
## 11177 3.2
## 18406 2.8
## 4479 2.0
## 3785 1.5
## 15467 1.5
## 34159 1.3
# Get data for libraries from long-term HSC population
groupDatShort <- groupDat %>%
filter(hscPop == "short")
countDatShort <- countDat[, names(countDat) %in% groupDatShort$libID]
# Filter genes with (cpm > 10) in < 10% of samples
dgeShort = setUpDGEList(countData = countDatShort, geneData = geneDat,
filterCount = 10,
filterPercentage = 0.20)
## keepRows
## FALSE TRUE
## 26097 11894
Looking just at the long-term HSC populations, WT and BCAP knockout mice seem to cluster separately. When short-term populations are included, there is less distinction between knockout groups.
Experimental, sequencing, and alignment variables plotted against the 1st principal component
Design: expression ~ koStatus
# Define the design matrix, only including terms corresponding to KO status;
# use voom to calculate transformed expression values
koDesignShort <- model.matrix(~ koStatus, data = groupDatShort)
koVoomShort <- voomWithQualityWeights(dgeShort, design = koDesignShort,
plot = TRUE)
# Fit model for the group design
koFitShort <- lmFit(koVoomShort, koDesignShort)
koFitShort <- eBayes(koFitShort)
koResultsShort <- topTable(koFitShort, number = nrow(dgeShort))
koResultsShort %>%
filter(mgiSymbol == "Pik3ap1")
## ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
## 1 ENSMUSG00000025017 Pik3ap1 0.92 6.7 3.8 0.0038 0.66 -1.6
koResultsShort %>% head()
## ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val
## 9143 ENSMUSG00000032690 Oas2 -1.34 6.5 -5.7 0.00024 0.66
## 6486 ENSMUSG00000028037 Ifi44 -0.52 7.3 -5.5 0.00030 0.66
## 4854 ENSMUSG00000025035 Arl3 1.28 5.6 5.6 0.00026 0.66
## 21595 ENSMUSG00000073643 Wdfy1 -0.47 7.7 -5.1 0.00053 0.66
## 20176 ENSMUSG00000068394 Cep152 1.35 5.7 5.3 0.00039 0.66
## 15364 ENSMUSG00000051910 Sox6 -0.88 6.9 -4.9 0.00071 0.66
## B
## 9143 0.778
## 6486 0.711
## 4854 0.266
## 21595 0.145
## 20176 0.051
## 15364 -0.080